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Abstract 

We consider the problem of learning a non-negative linear classifier with 1-norm of at most k 
and a fixed threshold, under the hinge-loss. This problem generalizes the problem of learning a 
fc-monotone disjunction. We prove that we can learn efficiently in this setting, at a rate which is 
linear in both k and the size of the threshold, and that this is the best possible rate. We provide 
an efficient online learning algorithm that achieves the optimal rate, and show that in the batch 
case, empirical risk minimization achieves this rate as well. The rates we show are tighter than the 
uniform convergence rate, which grows with k^. 

Keywords: linear classifiers, monotone disjunctions, online learning, empirical risk minimization, 
uniform convergence 
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1. Introduction 

We consider the problem of learning non-negative, low-^i-norm linear classifiers with a fixed (or 
bounded) threshold. That is, we consider hypothesis classes over instances x G [0, l]'^ of the fol- 
lowing form: 



|x I—)- {w, x) 



w G 



l^^lll < ^1 5 



(1) 
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where we associate each (real valued) linear predictor 'm%kQ with a binary classifier^ 



X I—)- sign((t(;, x) 



1 if {w, x) > 
— 1 if {w, x) < 



(2) 



Note that the hypothesis class is specified by both the ^i-norm constraint k and the fixed thresh- 
old 9. In fact, the main challenge here is to understand how the complexity of learning 'Hk,e changes 
with 0. 

The classes 'Hk,e can be seen as a generalization and extension of the class of /c-monotone- 
disjunctions and r-of-A;-formulas. Considering binary instances x G {0, l}*^, the class of /c-monotone- 
disjunctions corresponds to linear classifiers with binary weights, w G {0, l}'^, with \\w\\i < k and 
a fixed threshold of 6* = ^. That is, a restriction of T-Lj^ i to integer weights and integer instances. 
More generally, the class of r-of-A; formulas (i.e. formulas which are true if at least r of a specified 
k variables are true) corresponds to a similar rest riction, but with a threshold of 9 = r — ^. 

Studying A:-disjunctions and r-of-k formulas. iLittlestonel (|l988h presented the efficient Winnow 
online learning rule, which entertains an online mistake bound (in the separable case) of 0{k log d) 
for ^-disjunctions and 0{rklogd) for r-of-A;-formulas. In fact, in his analysis, Littlestone consid- 
ered also the more general case of real-valued weights, corresponding to the class Hkfi^ though 
still only over binary instances x G {0,1}'^ and only for separable data, and showed that Winnow 
enjoys a mistake bo und of 0(9k log in t his case as well. By applying a standard Online-to-Batch 
conversion (see e.g. IShalev-Shw£ff5 l2012'). one can also achieve a sample complexity upper bound 
of 0{9klog{d)/e) for batch supervised learning of this class in the separable case. 

In this paper, we consider the more general case, where the instances x can also be fractional, 
i.e. where x G [0, l]'^. More importantly, we consider also the agnostic, non-separable, case. In 
order to move on to the fractional and agnostic analysis, we must clarify the loss function we will 
use, and the related issue of separation with a margin. 

When the instances x and weight vectors w are integer-valued, we have that {w, x) is al- 
ways integer. Therefor, if positive and negative instances are at all separated by some predictor 
w (i.e. sign((i(;, x) — 9) = y where y G {±1} denotes the target label), they are necessarily sepa- 
rated by a margin of half. That is, setting ^ = r — ^ for an integer r, we have y{{w,x) — 9) > |. 
Moving to fractional instances and weight vectors, we need to require such a margin explicitly. And 
if considering the agnostic case, we must account not only for mis-classified points, but also for 
margin violat ions . As is standard both in online learning (e.g. the agnostic Perceptron guarantee in 
Gentilell2003l) and in statistical learning using convex optimization (e.g. support vector machines), 
we will rely on the hinge loss at margin halfo which is equal to: 2 • — yh{x)] ^. The hinge loss 
is a convex upper bound to the zero-one loss (that is, the misclassification rate) and so obtaining 
learning guarantees for it translates to guarantees on the misclassification error rate. 

Phrasing the problem as hinge-lo ss minimization over the hy pothesis class 'Hk,e, we can use 
Onli ne Exponentiated Grad i ent (EG) (Kiyinen and WarmuthL 19941 ) or Online Mirror Descent (MD) 
(e.g. IShalev-Shwartzl . 120071 : ISrebro et all l201lh . which rely only on the -bound and hold for any 
threshold. In the statistical setting, we can use Empirical Risk Minimization (ERM), in this case 
minimizing the empirical hinge los s , and rely on unifor m concentration for bounded ii predictors 
dSchapire et all 1 19971 : IzhangL I2OO2I : iKakade et al l . l2009l) . again regardless of the threshold. 



1. The value of the mapping when {w, x) — 6 can be arbitrary, as our results and our analysis do not depend on it. 

2. Measuring the hinge loss at a margin of half rather than a margin of one is an arbitrary choice, which corresponds to 
a scaling by a factor of two, which fits better with the integer case discussed above. 
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However, these approach yield mistake bounds or sample complexities that scale quadratically 
with the £i norm, that is with A;^ rather than with 9k. Since the relevant range of thresholds is 
< 6 < k, a. scaling of 6k is always better then k'^. When 6 is large, that is, roughly k/2, the 
Winnow bound agrees with the EG and MD bounds. But when we consider classification with a 
small threshold (for instance, ^ = ^ in the case of disjunctions, the Winnow analysis clarifies that 
this is a much simpler class, with a resulting smaller mistake bound and sample complexity, scaling 
with k rather than with fc^. This distinction is lost in the EG and MD analyses, and in the ERM 
guarantee based on uniform convergence arguments, and for small thresholds, where 9 = 0(1), the 
difference between these analyses and the Winnow guarantee is a factor of k. 

Our starting point and our main motivation for this paper is to understand this gap between the 
EG, MD and uniform concentration analyses and the Winnow analysis. Is this gap an artifact of 
the integer domain or the separability assumption? Or can we obtain guarantees that scale as 9k 
rather then k"^ also in the non-integer non-separable case? In the statistical setting, must we use an 
online algorithm (such as Winnow) and an online-to-batch conversion in order to ensure a sample 
complexity that scales with 9k, or can we obtain the same sample complexity also with ERM? Is it 
possible to establish uniform convergence guarantees with a dependence on 9k rather then A;^, or do 
the learning guarantees here arise from a more delicate argument? 

The gap between the Winnow analysis and the more general £i -norm-based analyses is par- 
ticularly disturbing since we kno w that, in a sense, on line mirror descent always provides the best 



possible rates in the online setting (|Srebro et al.l |2011[) . and uniform conc entration based gu arantees 



provide the best possible rates for supervised learning in the PAC model (lAlon et all 1 19931) . 
Answering the above questions, our main contributions are: 

• We provide a variant of online Exponentiated Gradient, for which we establish a regret bound 
of 0{y/ 9k log{d)T) for Tik^e^ improving on the 0{y/k'^ log(d)T) regret guarantee ensured 
by the standard EG analysis. We do so using a more refined analysis based on local norms 
(Section [3l). Using a standard online-to-batch conversion, this yields a sample complexity of 
0{9klog{d)/e'^) in the statistical setting. 

• In the statistical agnostic PAC setting, we show that the rate of uniform convergence of the 
empirical hinge loss of predictors in Hk^e is indeed Q{y/k'^ /m) where m is the sample size, 
corresponding to a sample complexity of Q{k'^/e'^), even when 9 is small (Section[5]). Never- 
theless, we establish a learning guarantee for empirical risk minimization which matches the 
online-to-batch guarantee above (up to logarithmic factors), and ensures a sample complexity 
of 0{9klog{d)/e^) also when using ERM. This is obtained by a more delicate local analy- 
sis, focusing on predictors which might be chosen as empirical risk minimizers, rather then a 
uniform analysis over the entire class Tik^e (Section lU. 

• We also establish a matching lower bound (up to logarithmic factors) of Q{9k/e^) on the 
required sample complexity for learning T-Lk^e in the statistical setting. This shows that our 
ERM analysis is tight (up to logarithmic factors), and that, furthermore, the regret guarantee 
we obtain in the online setting is likewise tight up to logarithmic factors. 

1.1 Related Prior Work 



We discussed Littlestone's work on Winnow at length above. In our notation, iLittlestonel (|l988h 



established a mistake bound (that is, a regret guarantee in the separable case, where there exists a 
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predictor with zero hinge loss) of 0{kO \og{d)) for when the instances are integer x G {0, l}"^. 
Littlestone also established a lower bound of k\og{d/k) on the VC-dimension of A;-monotone- 
disjunctions, con^esponding to the case 9 = ^, thus implying a ^}{k log{d/ k) / e"^) lower bound on 
learning i . However, the question of obtaining a lower bound for other values of the threshold 

9 was left open by Littles tone. 

In the agnostic case, Auer and Warmutlil ( 1998 ) studied the discrete problem of A;-monotone 



disjunctions, corresponding to Tij^ i with integer instances x e {0, l}'^ and integer weights w G 

{0, l}'^, under the attribute loss, defined as the number of variables in the assignment that need to 
be flipped in order to make the predicted label correct. They provide an online algorithm with an 
expected mistake bound of A* + 2\/ A*k\n{d/k) + 0{k\ii{d/k)), where A* is the best possible 
attribute loss for the given online sequence. An online-to-batch conversion thus achieves here a zero- 
one loss which converges to the optimal attribute loss on this problem at the rate of 0{k h\{d/k) /e^). 
Since the attribute loss is upper bounded by the hinge loss, this result holds also when replacing A* 
with the optimal hinge-loss for the given sequence. This establishes an agnostic guarantee of the 
desired form, for a threshold of 9 = \, and when both the instances and weight vectors are integer. 
We are not aware of work on %k,e in the agnostic case for ^ > ^ or when the instances x or the 
weights w are fractional. 

2. Notations and definitions 

For a real number q, we denote its positive part by [g]^ := max{0, q]. We denote universal positive 
constants by C. The value of C may be different between statements or even between lines of the 
same expression. We denote by R*^ the non-negative orthant in W^. 

We will slightly overload notation and us id G %k,e to denote both the vector w G W\_ and the 
hnear predictor x ^ {w,x) —9 associated with it, where 6 is implied. 

For convenience we will work with half the hinge loss at margin half, and denote this loss, for a 
predictor w G 'Hk,e< for 9 G [0, k], by 

eg{x, y, w) := ^ - y{{w, x) - 9) 

The subscript 9 will sometimes be omitted when it is clear from context. 

Echoing the half-integer thresholds for /c-monotone-disjunctions, r-of-A; formulas, and the dis- 
crete case more generally, we will denote r = 9 + ^, so that 9 = r — ^. In the discrete case r is 
integer, but in this paper ^ < r < k — ^ can also be fractional. We will also sometimes refer to 
r' = ^ — 9. Note that r' can be negative. 

In the statistical setting, we refer to some fixed and unknown distribution D over instance-label 
pairs {x, y), where we assume access to a sample (training set) drawn i.i.d. from D, and the objective 
is to minimize the expected loss: 

le{w,D)=¥..,^y^Dnx,y,w)]. (3) 

When the distribution D is clear from context, we simply write £e{w), and we might also omit the 
subscript 9. For a set of predictors (hypothesis class) H, we denote ^^(-fr, D) := mmw(^H £e{w, D). 
For a sample S G ([0, l]*^ x {±1})*, we use the notation 

1 l^^l 

%[f{Z)] = —Y^J{Si) (4) 
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and again sometimes drop the subscript S when it is clear from context. 



2.1 Rademacher complexity 

The empirical Rademacher complexity of the Winnow loss for a class C M'^ with respect to a 
sample 5 = ((xi,yi),...,(x„„y„)) G ([0, l]'^ x {±1})- is 



n{W,S) := — E 
m 



sup 



i=l 



(5) 



where the expectation is over ei, . . . , which are independent random variables drawn uniformly 
from {±1}. The average Rademacher complexity of the Winnow loss for a class C with 
respect to a distribution D over [0, l]'^ x {±1} is denoted by 

TZmiW, D) := ^Sr^Dm [n{W, S)] (6) 

We also define the average Rademacher complexity of W with respect to the linear loss by 



m 



sup 



i=l 



(V) 



where the expectation is over ei, . . . , as above and ((Xi, li), . . . , {Xm,Ym)) ~ D^. 
2.2 Probability tools 

We use the following form of Bernstein's inequality: For a random variable X G {0, 1}, with 
probability at least 1 — 6 over n i.i.d. draws of X, 



n 



n 



(8) 



The same holds for E[X] - E[X]. 

We further use the following lemma, which bounds the ratio between the empirical fraction of 
positive or negative labels and their true probabilities. We will apply this lemma make sure that 
enough negative and positive labels can be found in a random sample. 



Lemma 1 Let B be a binomial random variable, B ~ Binomial{m,p). if 

161n(l/5) 



P> 



m 



(9) 



then with probability of at least 1 — 5, B > mp/2. 

Proof Denote p = B/m. From Bernstein's inequality (Eq. ([8)), with probability of at least 1 — 5: 



/In(lM) , In(lM), 
P>p-2\ ^ ' ' max(ff, ^ ' ' 



m 



m 
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Under Eq. Q, we have that max(p, ^^^^^^ ) = p and that '"^^^^ < j^, which yields 



3. Online Algorithm 

Consider the following algorithm: 



Unnormalized Exponentiated Gradient 
(unnormalized-EG) 

parameters: rj,X > 
input: zi, . . . , G M'^ 
initialize: wi = (A, . . . , A) G M'' 

update rule: \/i,wt+i[i] = wj[z]e~''^'W 



The following theor em provides a regret bo und with local-norms for the unnormalized EG al- 
gorithm. For a proof see IShalev-Shwartzl (120121) . Theorem 2.23. 



Theorem 2 Assume that the unnormalized EG algorithm is run on a sequence of vectors such that 
for all t, i we have Tjzt [i] > — 1. Then, for all u > 0, 



^(wt - U,Zt) < 



T d 



t=l 



t=l i=l 



Now, let us apply it to a case in which we have a sequence of convex functions /i, . . . , /t, and 
zt is the sub-gradient of ft at Wf Additionally, set A = 1/d and consider u s.t. ||u||i < k. We 
obtain 

Theorem 3 Assume that the unnormalized EG algorithm is run with A = 1/d. Assume that for 
all t, we have Zf G dft{'Wt), for some convex function ft- Further assume that for all t,i we have 
> — 1, and that for some positive constants a, f3 we have that 



1=1 

Then, for all u > 0, with ||u||i < kwe have 



(10) 



2A;ln(A;d) \ 
r/ / 
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Proof Using the convexity of ft and the assumption that zt G dft{wt) we have that 

T T 



t=l t=l 

Combining with Theorem |2] we obtain 

2^(/t(wt) - /i(u)) < +7] 2_^2^wt[i\zt[i] . 

t=i ^ t=i 1=1 

Using the assumption in Eq. ( fTOb . the definition of A = and the assumptions on u, we obtain 

J2iftM - ft{u)) < + r?/3r + r?a /,(w,) . 

t=i ^ t=i 

Rearranging the above we conclude our proof. ■ 
We can now show the desired regret bound for our algorithm. 

Corollary 4 Fix any sequence (xi, yi), {x2,y2), • • • > {xtiVt) G [0, l]'^ x {±1} and assume T > 



8k \n{kd) jr. Suppose the unnormalized EG algorithm listed in Sectionals run using rj := \J ^^M^^ 
A := 1/d, and any zt G dw£{xt,yt,Wt) for all t. Fix any u G M^J., and define -LuEG '■= 
T J2t=i ^{xt,yt, wt) and L{u) := ^ Ylt=i K^t, Vt,u). Then 



, ,w NO 8k\n{kd) 8rkln(kd) 8kln(kd) 
LvEG < Liu) + ^L{u)^ • + y ^ + 

Proof Every sub-gradient zt G dw({xt,yt, wt) is of the form zt = atXt for some at G {—1, 0, +1}. 
SinceO < xt[i\ < land?i;t[i] > for alH, it follows that ^^^L^ t(;t[«]-2;tW^ = \at\ Yft=i'w[i]xt{if < 
\at\{wt,xt) ■ Now consider three disjoint cases. 

• Case 1: {wt,xt) < r. ^^&riYA=iWt[i]zt[if < {wt,xt) < r. 

• Case 2: {wt,xt) > r and y = I. Then at = and Yli=i wt[i]zt[i]'^ = 0. 

• Case3: {wt,xt) > r and y = -1. Then ^'^^^wt[i\zt[i\'^ < {wt,xt) < [r' + {wt,xt)]+-r' < 
[r' + {wt,xt)]+ + r. 

In all three cases, the final upper bound on Yli=i ^tW^i most £{xt, yt, Wt) + r. Therefore, 
Eq. ([Tol l from Theorem |3]is satisfied with ft{w) := £{xt,yt,w), a := 1, and (3 := r. The claim 
now follows from Theorem |3] with this choice of ft and the given settings of rj, A, and zt (using the 
inequality 1/(1 - x) < 1 + 2x for x G [0, 1/2]). ■ 
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4. ERM Upper bound 

We now proceed to the batch setting. We wish to show an upper bound on i{w) — i{w*), where 
w* G avgmm^^■^y^E[e{X, Y,w)], and w G argmin^g^^ i ^.gj.^] y^, tu) is an ERM. We 
will prove the following theorem: 

Theorem 5 For k > r > 0, with probability 1 — 6 

,(,^) <,(„,.)+, /own Wln3(„) + ln(l/i)))^ (11, 

y m 

Our proof strategy will be to consider the loss on negative examples and the loss on positive exam- 
ples separately. Denote 

£^{w,D)=E^x,y)-d[KX,Y,w) I y = -1], and 
i+{w,D)=E^x,Y)^D[KX,Y,w) I y = +1]. 

For a given sample {{Xi,Yi) . . . , {Xm,Ym)), Denote = E[i{X,Y,w) \ Y = -1] and 

similarly for i+{w). As we show in Section |S!2l uniform convergence for negative examples is too 
slow if we consider any w G W^- However, we will show that the rate is fast enough for any w 
that might be returned by an algorithm that minimizes the loss on a sample drawn from D. For 
positive labels, we will show that with high probability over the draw of an i.i.d. sample from D, 
the true loss of any w G Wk on examples with positive labels is close to the empirical loss of that w 
on positive examples. We will then combine the two results while taking into account the balance 
between positive and negative labels in D. 

4.1 Convergence on Negative labels 

We now commence our proof for the convergence rate of ERM for the Winnow loss. As shown in 
Theorem 1211 the empirical Winnow loss for negative examples does not converge fast enough to the 
true loss on negative examples for all w G Wk- Luckily, not all w G might be returned by an 
algorithm that minimizes the Winnow loss. We now show that with high probability the output of the 
ERM algorithm belongs to a more restricted class than W^- Fix a sample {{xi, yi), . . . , {xm, Um)), 
and let 

w G argmin — £{xi,yi,w). 

We first show a sample-dependent restriction on w. 

For a given distribution D, denote p_|_ = E(^ y)^£,[y = +1] and p+ = E\Y = +1], and 
similarly for p_ and p- . 

Lemma 6 

E[{w,X) I F = -1] < 4^. 
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Proof Let = \{i \ yi = +1}|, and m,_ = \{i \ yi = — 1}|. By the definition of the hinge 
function and the fact that {xi,w) > for all i we have that 

m+r' + ^ {xi,w) < ^ {r' + {xi,w)) 

< [r- {xi,w)]++ [r'+{xi,w)] + 

ie[m] 

By the optimality of w, 

£{xi,yi,w) < Y ^{xi,yi,0) = m^r + m+[r'] + . 

Therefore 

{xi, w) < ni-r + m-|-([r']+ — r') = ni-r + r']+ < (m_ + m^)r = mr. 

yi=-^ 

Dividing both sides by m_ we conclude our proof. ■ 

The next lemma will allow us to conclude from Lemma[6]that t() is in a restricted class with high 
probability over the samples. 

Lemma 7 For any distribution over [0, 1]*^, with probability 1 — 5 over samples of size n, for any 
w G Wk 

16k lii(-) 

E[{w,X)]<2E[{w,X)]+ 



n 



Proof For every j G [d], denote Uj = E[X[j]\. Denote ckj = ]E[X[j]]. By Bernstein's inequality 
(Eq.[8]l, with probability 1 — 5, 



In(lM) / ln(l/(^)\ . fa-j 81n(l/5)\ 
a, < a, + 2 J ■ max < + max (^^, , 

where the last inequality can be verified by considering the cases aj < ^^^^^^/^^ and Oj > l^MlZ^). 
Applying the union bound over j G [d\ we obtain that with probability of 1 — 5 over samples of size 
n, for any w G Wk 

E[KX)] = Ka) < 5: (a, + f + ^M^) 

Thus 
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We can now conclude a restriction on w with high probability. 
Theorem 8 Ifp-> then with probability 1 - 26 over samples of size m, 

E X)\Y = -l]< — + (12) 

p_ mp- 

Proof Lemma |7] implies that with probability of 1 — 5 over samples drawn from D that have n 
negative examples, 

E[(^,x) I y = -1] < 2E[(^,x) I y = -1] + 



n 



Therefore, by Lemma |6] 



E[(^.,x) I y = -1] < 2n{w,x) I y = -1] + l^^M^ 

^ 2r 16fcln(d/(^) 

< — + (13) 

where the last inequality follows from the assumption and Lemma [T] ■ 

This theorem shows that to bound the sample complexity of an ERM algorithm, it suffices to 
show convergence rates of the empirical loss for w that satisfy Eq. ([T2l ). For any 6 > and a fixed 
distribution D, define 

Ub = {w & \ \\w\\i < A;,Ez)[(u;,X)] < b}. 

Note that f/5 C Wk, and that b can be set according to Eq. ([T2l ) so that with high probability w G U^. 
We bound the rate of convergence of the empirical loss on negative examples to the true loss on 
negative examples for all w G U},. This is accomplished in two stages: first we bound TZ^{Ub, D) 
for any distribution D over [0, l]'^ X {±1}, and then we conclude a similar bound on IZmiUb-, D) 
for any D that draws only negative labels. 

We first prove a more general lemma that we will use to derive the desired bound. 

Lemma 9 For a fixed distribution over D over [0, 1]^ X {±1}, let aj = E(x^ y)^£)[X[j]], and let 
fi be a non-negative vector Define 

[/^ = G I {w,n) < 1}. 

then if dm > 3, 



Ri.(C/M)) < ma. 1 /321nM ,„,„./■„ Mrfm) 



j.aj>o Hj y m \ m 
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Proof Assume w.l.o.g that aj > for all j (if this is not the case, dimensions with aj = can be 
removed because this implies that X[j] =0 with probabiUty 1). 



sup ^ai{w,Xi) 

w:{w,iJ,)<l j^^^ 

m 

sup {w, ^ aiXi) 

w:{w,iJ.)<l j^^i 

EXi[j] 
O'i — r-r 

1 V— vm 



Therefore, using Massart's lemma and denoting aj = ^ Ylil[m] ^ibl' we have: 



m 



m 



'>ln(d) \/a 
max 



m j uli] 



'81n(d) 



max ■ 



Taking expectation over S and using Jensen's inequality we obtain 



i?^(^7^L>) = E5[i?^(^7^5)]< 



^ ' - Es max- 



By Bemstein's inequaUty, with probability 1 — 5 over the choice of {xi}, for all j G [d] 



Oij < aj + 2^1 ^ J ^ • max ( aj, ^ 1 . 



m 



m 



And, in any case, aj < 1. Therefore, 



«7 1 ^ 1 , ln(d/(5) / Hd/S)\ 

Eg [max ^..,^ 1 < max —rr^ | o + a, • + 2^ / ^ — - ■ max ( a,, ' 



m 



m 



(14) 
(15) 

(16) 



Choose S = 1/m and let j be a maximizer of the above. Consider two cases. If aj < \n{dm)/m 
then 



ctj 1 41n(dm) 

Eg [max — TTTTT I < max ■ 



m 



Otherwise, 



Eg [max -^irrr] < max — i-r((5 + So;,-) < max 
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All in all, we have shown 

i^^([7^ D) < max -L^^ • max (a,, . 



We now prove the desired Rademacher complexity bound on Ub- 
Theorem 10 For any distribution D over {X,Y) G [O,!]'^, if dm > 3, 



TZt{Ub,D)<\ ^max 6, ^ ' 



m \ m 



Proof Define aj and C/^ as in Lemma|9l Let J = {j e [d] \ aj > and J = {j ^ [d] \ aj < 
For a vector v ^M.'^ and a set I C [d], denote by v[I] the vector which is obtained from v by setting 
the coordinates not in / to zero. We have 



2 



n^{Ub,D) = -E[sup \ Y^eiY,{w,Xi 



1=1 

2 



E[sup \ y^eiY,{w[J],X,[J]) +y^e,Yi{w [J] ,X,[J])\] 



1=1 i=l 



2 2 
<-E[sup \ y^eiY,{w[J],X,[J])\] + -E[snp \S2eiY,{w[J], Xi[J])\] 

= ni{Ub,Di) + niiUb,D2), (17) 

where Di is the distribution of {X[J],Y) and D2 is the distribution of {X[J],Y). We now bound 
the two Rademacher complexities of the right-hand side using Lemma |9] 

To bound TZ^{Ub,Di), define fii G by fii[j] = aj/b. It is easy to see that Ub C C/^i. 
Therefore 7^^(C/fe, Z^i) < ^lm{U^'^ , Di). By Lemma|land the definition of m 



,rr,n^ 1 /321n(d) 

7^^([7'^l) < max— -J ^ 



max I aj, 



\ii{dm) 



jeJ fJ,i[j] V m \ m 

b /321n(d) 7 In(dm) 
max — \ I max ( aj , 



jeJ aj v m \ m 

/ b 321n(d) /, b Inidm) 
max 4 / max I o, 



jgJ V aj m \ Oj m 

By the definition of J, for all j E J we have — < A;. It follows that 



m \ m 



k ln(dm) 



(18) 
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To bound 7^^(^7^„ L»2), define /i2 G by /i2[j] = ^. Note that = Wk and Ub Q Wk, 
hence TZ^iUb, D2) < ■/^^(f/^' , D2). By Lemma|land the definition of ^2 



7^^(^.-,Z^2)<maxJ-J^^ 



max I a. 



/32/fcln(d) 7 /c Irifdm) 
= max 4 / max kuj, 

By the definition of J, for all j £ J we have A;aj < 6. Therefore 



..r„. /32fcln(d) / kln(dm)\ 

7^^ [7^^D2 <W ^max 6, ^ '-). (19) 

y m \ m J 

Combining Eq. ([17] ). Eq. ([TSl l and Eq. ( [T9l ) we get the statement of the theorem. ■ 

We can now derive our convergence result for negative examples. 

Corollary 11 Let 6 > 0. There exists a universal constant C such that for any distribution D over 
[0, l]'^ X {±1} that draws only negative labels, with probability 1 — 5 over samples of size m, for 
any w G Ub, 

< i-M + C ( [¥±FS^^^ + kHedm/6)\ ^^^^ 
\ V m ml 



Proof Define (/> : M — )■ M by 4i{z) = [r' — z]^. Since D draws only negative labels, the Winnow 
loss on pairs {X,Y) drawn from D is exactly (j)(Y{w, X)). Note that (p is an application of a 1- 
Lipschitz function to a translation by r' of the linear loss. Thus, by the properties of the Rademacher 
complexity and by Theorem [lO] we have, for dm > 3, 

n^iUb, D) < n'^iUb, I?) + a/ — 

V m 



I28kln(d) / A;ln(dm) , 

<W ^max 6, ^ -]+\—- (21) 

\ m \ m 




Assume that r' < 0. By Talagrand's inequality (see e.g. lBoucheron et al.UlOOSL Theorem 5.4), with 



probability 1 — 5 over samples of size m drawn from D, for all w £ Ub 



£-{w) < £-{w) + 2nm{Ub,D) + \ \ . (22) 

V m 6m 

To bound the variance of i{X, Y, w), we note that £{X, Y, w) G [0, k]. In addition, Y = —1, thus 

i{X, Y, w) = [r' + {w, X)]+. Since / < 0, for any w e Ub 

Var (X, w)]<k- E[£^ {X, w)] < k ■ E[{w, X)] < kb. (23) 
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Combining Eq. (|2T] ). Eq. (l22l ) and Eq. (l23l) we conclude that there exists a universal constant C such 
that for any w ^ Ub, 

e.H < Liw) + C f /M±SM^ + kHedm/5)\ 
\ V m ml 

Now, for any r' > 0, the values of ^_ {w) and ^_ (tf) are the same as the values for r' = except 
for an identical additive term of r', thus the same result holds. ■ 



4.2 Convergence on Positive Labels 

For positive labels, we show a uniform convergence result. The idea of the proof tech nique below is 



as follows. First, following a technique in the spirit of the one given in 'Zhang (|2002h . we show that 



the regret bound for the online learning algorithm presented in Section[3]can be used to consti"uct a 
small cover of the set of loss functions parameterized by Wk- Second, we convert the bound on the 
size of the cover to a bound on the Rademacher complexity, th us showing a u niform convergence 



result. This argument is a refinement of Dudley's entropy bound (|DudleyLll967n . which is stated the 
most explicitly in S rebro et al. (2010 ) (Lemma A.3) 

We start with the following direct corollary of Theorem |3] 

Corollary 12 Assume that the conditions of Theorem\3\hold. Assume also that there is u such that 
ft{u) = 0/or all t. Set r/ = \J'^^^^^ and assume that T is large enough so that arj < 1/2. Then, 

T 

Y,fti^t)<W2pkHkd)T. 

t=i 

Let A; > r > be two real numbers and let W C R^. Let /w denote the function defined by 

/w(x,y) = ^(x,y,w), 

and consider the class of functions 

Fw = {U\y^^W}. (24) 

Given S = ((xi, yi), . . . , (x^, y^)), where Xj G [0, l]'^ and yi G {±1}, we say that {Fw, S) is 
(oo, e) -properly-covered by a set F C Fw if for any / G F\y there is a. g ^ V such that 

ll(/(xi,yi),. . .J(yim,ym)) - {g{yii,yi), • • • ,5(x 

We denote by Noo(l^, S,e) the minimum value of an integer such that exists a F C Fw of size 
A'" that (oo, e)-properly-covers {Fw,S). 

The following lemma bounds the covering number for Fw, for sets 5* with all-positive labels yj. 

Lemma 13 Let S = ((xi, 1), . . . , (x^, 1)), where Xj G [0, l]'^, and let Fw be as defined in 
Eq. (1211). Then, 

lnNoo(VFfc,5,e) < C ■ rkhi{kd)\u{m) / . 
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Proof We use a technique in the spirit of the one given in lZhangJ (120021) . Fix some u, with u > 
and ||u||i < k. For each i let 



9r(w) 

and define the function 



I (w, Xi) - (u, Xi) I if (u, Xi) < r 
[r-(w,Xi)], o.w. 



Gu(w) = max5rj"(w) . 

i 

It is easy to verify that for any w^, 

||(/w(xi, 1), . . . , /w(x„„ 1)) - (/„(xi, 1), . . . , U^m, l))||oo < Gulw). 

Now, clearly, Gu(u) = 0. In addition, for any vi^ > 0, a sub-gradient of at is obtained 
by choosing i that maximizes g^{v^) and then taking a sub-gradient of 5", which is of the form 
z = axj where a G {—1, 0, 1}. If a G {— l, 1}. it is easy to verify that 

< (w, Xi) < 5"(w) + r = G(w) + r . 

j 

If a = then clearly < G'u(w) + r as well. 

We can now apply Cor. [12] by setting ft = Gu for all t, setting a = 1 and /3 = r in Eq. (flOl ). 
and noting that since Xj G [0, 1]"', we have zt G [—1, 1]°' for all t. If < 1 we have r]Zt[i] > —1 for 

all t, i as needed. Since i] = ^J^^^^^, this holds for all T > 2k ln{kd)/r. 

We conclude that if we run the unnormalized EG algorithm with T > 2k ln{kd)/r and r] and A 
as required, we get 

T 

Gu(wt) < G ■ yJrk\n{kd)T. 

t=i 

Dividing by T and using Jensen's inequality we conclude 



G, 



rk ln{kd) 



Denote Wu = w^. Setting e = C ■ y L^l^Ml^ jt follows that the following set is a (00, e)- 

proper-cover for {Fw^^ , S) : 

V = {wu I u G Wk}. 

Now, we only have left to bound the size of V. Consider again the unnormalized EG algorithm. 
Since zt = axj for some a G {—1, 0, +1} and i G {!,..., m}, at each round of the algorithm 
there are only two choices to be made: the value of i and the value of a. Therefore, the number of 
different vectors produced by running unnormalized EG for T iterations on Gu for different values 
of u is at most (3m)^. Thus \ V\ < {3m)'^ . By our definition of e, 

ln|y| < rin(3m) < G ■ rkln{kd)ln{m)/e'^ . 

This concludes our proof. ■ 
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Using this result we can bound from above the covering number defined using the Euclidean 
norm: We say that {Fw, S) is (2, e) -properly-covered by a set y C Fw if for any / G F^ there is 
ag &V such that 



m 



ll(/(xi,yi), . . . ,/(xm,ym)) - (5(xi,yi),...,5(xm,y' 



m))\\2 



< e. 



We denote by N2(iy, 5, e) the minimum value of an integer such that exists a F C Fw of 
size N that (2, e)-properIy-covers (F^y, 5*). It is easy to see that for any two vectors u,v £ W^, 
-^\\u - v\\2 < \\u - v\\^. It follows that for any W and S, we have N2(W, S, e) < Noo(l^, S, e). 

The N2 covering number can be used t o bound the R ademacher complexity of {Fw,S us- 
ing a refine ment of Dudley's entropy bound (IDudleyL ll967|), which is stated the most explicitly in 
Srebro et al.. QOlO) (Lemma A.3). The lemma states that for any e > 0, 



where B is an upper bound on the possible values of / G Fyy on members of S. For S with 
all-positive labels we clearly have B < r. 
Combining this with Lemma [131 we get 



n{Wk,S) < C-(^e + 
Setting e = r/m we get 



n{Wk,s) <c 



rk \i\.{ekd) ln'^(?7i) 



m 



Thus, for any k,d,m > 1, and any distribution D over [0, 1]^ x {±1} that draws only positive 
labels, we have 



nm{Wk,D)<c 



' rk \n{ekd) In^ (m) 



m 



By Rademacher sample complexity bounds iBartlett and MendelsonI (120021) . and since £ for pos- 
itive labels is bounded by r, we can immediately conclude the following: 

Theorem 14 Let k > r > 0. For any distribution D over [0, l]'^ x {±1} that draws only positive 
labels, with probability 1 — 5 over samples of size m, for any w € Wk, 



m 



m 



lrk{ln{ekd) In^ (m) + \n{l / 5)) 



m 
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4.3 Combining negative and positive losses 

We have shown separate convergence rate results for the loss on positive labels and for the loss on 
negative labels. In this section we combine these results to achieve a convergence result for the 
full Winnow loss. For this, we need to adapt the convergence results achieved above to take into 
account the fraction of positive and negative labels in the true distribution as well as in the sample. 
The following theorems accomplish this for the negative and the positive cases. 

Theorem 15 There exists a universal constant C such that for any distribution D over [0, 1]"' x 
{±1}, with probability 1 — 5 over samples of size m 



p+l+{w)<p+t+{w) + C^ 



Proof First, if p^ < i6in(i/^) ^j^^j^ ^j^g theorem trivially holds. Therefore we assume that > 
i^MlZ^) . We have 

m 

p+l+{w) = p+i+{w) + {p+-p+)i+[w) +p+{^+{w) - h{w)). (25) 

To prove the theorem, we will bound the two rightmost terms. First, to bound (p+ —pj^)ij^{w), note 
that by definition of the loss function for positive labels we have that l+{w) G [0, r]. Therefore, 
Bernstein's inequality (Eq. ([8])) implies that with probability 1 — 5/3 



Wk{\x).{kd) ln3(m) + ln(3/(5)) 
m 



^ /ln(3/(5) , ln(3/5), /4rln(3/5) 

V m m \ m 

Second, to bound p^{i^{w) — £+(!«)), we apply Theorem [14] to the conditional distribution 
induced by Z) on X given y = 1, to get that with probability 1 — 6/3 



p+{i+{w) - i+{w)) < p^ . C ■ 



'rk{ln{ekd) ln^{m) + ln(3/(5)) 

mp-^- 



Using our assumption on we obtain from Lemma[T]that with probability 1 — 6/3, p+/p+ < 2. 
Therefore, p^ / y^P+ < V^P+ < V^- Thus, with probability 1 — 26/3, 



V m 



Combining Eq. (I25I ). Eq. (I26l l and Eq. (1271) and applying the union bound, we get the theorem. 



Theorem 16 There exists a universal constant C such that for any distribution D over [0, 1]'^ x 
{±1}, with probability 1 — 6 over samples of size m 



p.iA^) < paA^) + C \ J ^^Hedm/6) ^ kHedm/6) ^^^^ 
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Proof First, if p_ < ^^^^^^^^ then the theorem trivially holds (since i- {w) G [0, r + k]). Therefore 

we assume that p_ > Thus, by Lemma[Il p_ > p-/2. 

We have 

p-i-iw) =p^i^{w) + (p_ - p^)i^{w) + p-{l-{w) - £^{w)). (29) 

To prove the theorem, we will bound the two rightmost terms. First, to bound (p_ — p-)i-{w), 
note that by Bernstein's inequality and our assumption on with probability 1 — 6 



m m \ m 



By Lemma[6]and Lemma[I] ^- (w^) < |^ < In addition, by definition < r + k < 2k. 



Therefore 



(p. -p^)L{w) < imm{ — ,k)\ r- ^3Q) 

p_ V m 



Now, if k > 2r/p_, then the right-hand of the above becomes 



g r /p_ ln(l/5) _ g /(r/p_) • r ln(l/(^) ^^Jk-r ln(l/5) 



p^ \ m V m V m 

Otherwise, k <2r /p_ and the right-hand of Eq. (l30l ) becomes 



^, P-Hm < ^ / (2r/fc) ln(l/^) ^ /A: . r ln(l/5) 



m V m V m 

All in all, we have shown that 



P_-P_K-H<8a/ (31) 

V m 

Second, to bound p-{£- (w) — (w)), recall that by Theorem [H we have 

w Giw^Rll \\w\\i < k,ED[{w,X) I y = -1] < b}, 

where b is defined as 

^_ 8r ^32k\n{d/6) ^ C ^^^kln{d/6) 
p^ mp- ~ P- m 

Thus, by Cor. [TT] with probability 1 — 6 



Uw) < L{w) + C ( m + r)Hedmp./6) ^ k\n{edmp./6) 



mp- mp- 



Since p- > p_/2. 



£.{w) < L{w) + C ikb + r)Hedm/6) ^ kHedm/6) 



mp- mp- 
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for some other constant C. Therefore, substituting b for its upper bound we get 



m m 



Combining Eq. ( [29l ). Eq. (|3TI ) and Eq. (|32] ) we get the statement of the theorem. ■ 

Finally, we are ready to prove our main result for the sample complexity of ERM algorithms for 
Winnow. 

Proof [of Theorem|5l From Theorem [15] and Theorem [T6] we conclude that with probabihty 1 — 6, 



< P-L(^) +p,f-,(^.) + . 0('-Mln(M) lu^(m) + Hm)) ,33, 

V m 

Now, 

(w) + p+i+{w) = i{w) < i{w*). (34) 

We have E[i{X,Y,w*)] = £{w*) < £{0) < r. Therefore, by Bernstein's inequality we have that 
with probability 1 — 5 



iiw*) = E[i{X,Y,w*)] < E[i{X,Y,w*)] + d^-^^^^max{E[i{X,Y,w*)],^-^^^^} 

m m 



\ m m 
Combining this with Eq. ([34l) we get that with probability 1 — 5 



V m m 

In light of Eq. ([33l) . we conclude Eq. (fTTI) 



5. Lower Bounds 

In this section we provide lower bounds for the learning rate and for the uniform convergence rate 
of the Winnow loss ig. 



5.1 Learning rate lower bound 

Fix a threshold 9. The best Winnow loss for a distribution D over [0, 1]^ x {±1} using a hyperplane 
from asetWQ is denoted by ig{W) = min^^w £e{w)- The following result shows that even 
if the data domain is restricted to the discrete domain {0, l}*^, the number of samples required for 
learning with the Winnow loss grows at least linearly in 6k. 
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Theorem 17 Let k >1 and let G [1, k/2]. The sample complexity of learning with respect to 
the loss iff is il{9k/e'^). That is, for all e G (0, 1/2) if the training set size is m = o{9k/e^), then 
for any learning algorithm, there exists a distribution such that the classifier, h : {0, 1}^ — )• M+, 
that the algorithm outputs upon receiving m i.i.d. examples satisfies ie{h) — ^g(VFfc) > e with a 
probability of at least 1/4. 

In the following construction we use the notion of a Hadamard matrix. A Hadamard matrix of 
order n is an n x n matrix iJ„ with entries in {±1} such that HnH^^ = nin- In other words, such 
that all rows in the mat rix are orthogon al to each other. Hadamard matrices exist at least for each n 
which is a power of 2 dSylvesten, Il867h . 

Lemma 18 Assume k is a power of 2, and let d = k"^. Let xi, . . . C {±1}'^ be the rows of the 
Hadamard matrix of order d. For every y G {±1}'^, there exists a w ^ W = {w G [—1, l]'^ | 
\\w\\ < k} such that for alii G [d], y[i]{w,Xi) = 1. 

Proof By the definition of a Hadamard matrix, for all i / j, {xi,Xj) = 0. Given y G {±1}'^, set 

= 1 Sje[d] yj^j- ^^^^ ^^^^ ^' 

yi{w,Xi) = yji ^ yj{xi,Xj) = ^yf{xi,Xi) = ^W^iWl = 1- 
It is left to show that w G W . First, for all i G [d], we have 



which yields w G [—1, 1]*^. Second, using ||w||i < II2 and 



|2 = ^^^y;) = J_ ^ {yiXi,yjXj) = y^i^^^^^) = i ^= ^' 

we obtain that \\w\\i < \fd = k. 



\W\\y — Nw,^, — ^2 ^ \4<i-i;yj-J/ — ^2 Z-,. tfi N-^'*' -^'V ~ ^2 

i,j&[d] i€[d] ie[d] 



Lemma 19 Let k be a power of 2 and let d = 2k'^ + 1. There is a set {xi , ■ ■ ■ , a^fcz} Q {0, l}'^ such 
that for every y G {il}'^ , there exists a w G such that for all i G [k"^], y[i]{{w,Xi) —k/2) = ^. 

Proof From Lemma [TSl we have that there is a set X = {xi, . . . , ^^2} C {±1}'^^ such that for 
each labeling y G {±1}*^ , there exists a Wj, G [—1, 1]*^ with ||wj,||i < k such that for all i G [A;^], 
y[i]{wy,x.i} = 1. We now define a new set X = {xi, . . . , xj^2} C {0, 1}'^ based on X that satisfies 
the requirements of the lemma. 

For each i G [A;^] let Xi = [^^^, ^-5^, 1], where [•, •, •] denotes a concatenation of vectors and 1 
is the all-ones vector. In words, each of the first fc^ coordinates in Xi is 1 if the corresponding coor- 
dinate in is 1, and zero otherwise. Each of the next fc^ coordinates in Xi is 1 if the corresponding 
coordinate in Xj is — 1, and zero otherwise. The last coordinate in Xi is always 1. 
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Now, let y G {±1}^ be a desired labeling. We defined Wy based on Wy as follows: Wy = 
[[tUy]+, [— t(7j^]+, where hy z = [v\+ we mean that z[j] = max{t;[j], 0}. In words, the 

first k"^ coordinates of Wy are copies of the positive coordinates of Wy, with zero in the negative 
coordinates, and the next k"^ coordinates of Wy are the absolute values of the negative coordinates 
of Wy, with zero in the positive coordinates. The last coordinate is a scaling term. 



We now show that Wy has the desired property on X. For each i G [A;^] 



/- ~ \ 1^ ~\~ [ r -, , k Wy 1 

{Wy,Xi) = {——,[Wy\ + } + {——,[-Wy\ + } + 



2 ■ 2 

Wy\i/2 + {xi, Wy)/2 + ^~|^^*^ = {xi,Wy)/2 + k/2 = y,/2 + k/2. 



It follows that yi{{wy,Xi) -k/2)= yf/2 = 1/2. 
Now, clearly Wy G M^|_. In addition. 



k — \\Wy\\l 

\Wy\\i = \\Wy\\i -I — = + k/2 < k. 



Hence Wy G Wk as desired. 



Lemma 20 Let z be a power of 2 and let k such that z divides k. Let d = 2kz + k/ z. There is a 
set {xi, . . . , Xzk\ ^ {0, l}'^ such that for every y G {±1}^'^, there exists a w G Wk such that for 
alii G [zk], y[i]{{w,Xi) — z/2) = ^. 

Proof By Lemma [T9l there is a set X = {xi, . . . , Xj,2} C {0, such that for all y G {±l}^^ 

there exists awy G '^'^ such that \\wy\\i < z and for all i G \z^\, Xj) — 2/2) = \. 

We now construct a new set X = {xi, . . . , Xzk} Q {0, ip^^+fc/z follows: For i G [zk], let 
n = [i/z^l and m = i mod z"^, so that i = nz^ + m.The vector Xi is the concatenation of ^ = ^ 
vectors, each of which is of dimension 2z'^ + 1, where all the vectors are the all-zeros vector, except 
the (n + l)'th vector which equals to Xm+i- That is: 



eR2-''+i eR2-'+i block n+i eK^-^+i eR2-''+i ^ 
Xi = [ , x"^ , ]g 



Given y G {±1}'^^, let us rewrite it as a concatenation of k/z vectors, each of which in {±1}^ , 
namely, 

y = [ m ,...,^ik/z)]G{±l}'' . 

Define Wy as the concatenation of k/z vectors in {±1}^ , using Wy defined above for each y G 
{±1}^^, as follows: 

'^y = [ ^y(i) ' • • • ' '^y{k/z)] e . 
For each i such that n = [i/z'^\ and m = i mod z^, we have 

{wy,Xi) - z/2 = {wy(^„^i),Xm+i) - z/2 = + l)[m + 1]. 
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Now y{n + l)[m + 1] = y[i], thus we get y[i]{{wy,Xi) — z/2) = ^ as desired. Finally, we observe 
that \\wy\\i = ^ne[k/z] \\'Wyin)\\i < k/z ■ z = k, hence Wy G Wk- ■ 

Proof [of Theorem[T7l Let A: > 1, 6^ e [\, |]. Define z = 26. Let n = max{n | 2" < z}, and let 
m = max{m | m2" < k}. Define z = 2" and k = m2". We have that z is a power of 2 and z 
divides k. Let d = 2kz + k/z. By Lemma[20l there is a set X = {xi, . . . , x-^} C {0, 1}'* such that 
for every y G {ztl}!'''-!, there exists awy G Wk such that for all i e [zk], y[i\{{wy.,Xi) — z/2) = ^ 
Now, let d = d + 1, and define Wy = [wy, and Xi = [xi, 1]. It follows that 



2- 



y[{\{{wy,Xi) -0) = y[i]{{wy,Xi) - z/2) 

= y[i]{{wy,Xi) + z/2 - S/2 - z/2) 

1 

2' 



y\i]{{wy,Xi) - z/2) 



We conclude that for all i S [zk], tt;j^) = and £e{xi, 1 — y[i],Wy) = 1. Moreover, 

sign{{wy,Xi) -9) = y[i]. 

Now, for a given w define hyj{x) = sign((t(;, x,) — 9), and consider the binary hypothesis class 
H = {hw I w G Wk} over the domain X. Our construction of Wy shows that the set X is shattered 
by this hypothesis class, thu s its VC dimension is at least \X\. By VC-dimension lower bounds (e.g. 
Anthony and Bartlettlll999l. Theorem 5.2), it follows that for any learning algorithm for H, if the 



training set size is o(|X|/e^), then there exists a distribution over X so that with probability greater 
than 1 / 64, the output h of the algorithm satisfies 

E[/i(x) / y] > min E[/i^(x) / y] + e . (35) 

Next, we show that the existence of a learning algorithm for Wk with respect to Iq whose sample 
complexity is o(|X|/e^) would contradict the above statement. Indeed, let w* be a minimizer of the 
right-hand side of Eq. (l35T l. and let y* be the vector of predictions of w* on X. As our construction 
of Wy* shows, we have £0{wy*) = 'E[hw*{x) ^ y]. Now, suppose that some algorithm learns 
w eWk^o that ie{w) < tg{Wk) + e. This implies that 

ie{w) <£0{wy*) + e = E[h^,{x) ^y] + e. 

In addition, define a (probabilistic) classifier, h, that outputs the label +1 with probability p{w, x) 
where p{w, x) = min{l, max{0, 1/2 + {{w, x) — 9)}}. Then, it is easy to verify that 

nHx)^y]<ie{x,y,w) . 

Therefore, E[/i(x) y] < ie{w), and we obtain that 

E[Mx) /y] <E[/i^,(x) ^y] + e, 

which leads to the desired contradiction. 



We next show that the uniform convergence rate for our problem is in fact slower than the 
achievable learning rate. 
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5.2 Uniform convergence lower bound 

The next theorem shows that the rate of uniform convergence for our problem is too slow, even if 
the distribution draws only negative labels. 

Theorem 21 Let k > 1, and assume < k/2. There exists a distribution D over {0,1}^ +1 X Y 
such thatyx G {0,1}^, P[y = -1 \ X = x] = 1, andt{Wk,D) 
probability at least 1/2 over samples S ~ D^, 



3w£Wk, \i{w, S) - iiw, D)\ > n{^k^/m). (36) 

To prove this theorem we first show two useful lemmas. The first lemma shows that a lower 
bound for the Rademacher complexity of a function class implies a lower bound on the uniform 
convergence of this function class. The derivation is similar to the proof of the upper bound in 



Bartlett and MendelsonI (120021') . 



Lemma 22 Let Z be a set, and consider a function class F C [0, 1]^. Let D be a distribution over 
Z. IfTZm{F, D) > a, then with probability at least 1 — 5 over samples S ~ D"^, 



3f G F, \Ex^s[f{X)] - Ex^D[f{X)]\ > a/2 - 

Proof Denote ^[/, 5] = Exr^s[f{X)], wA E[f,D] = Ex~d[/(^)]- Consider two indepen- 
dent samples S = (Xi, . . . , X^), 5" = {X[, . . . , ~ D™, and let cr = (ui, . . . , am) be m 
independent random variables drawn uniformly from {±1}. We have 

2E5[sup \E[f, S] - E[f, D]\] > Es,5'[sup \E[f, S] - E[f, D]\ + sup \E[f, S'] - E[f, D]\] 

f&F f&F f&F 

> Es,5'[sup \E[f, S] - E[f, D]\ + \E[f, S'] - E[f, D]\] 
>EsM^^P\E[f,S]-E[f,S']\] 



-E5,5'[sup| f{X,)-f{Xl) 
-E5,5'[sup| V f{X,)-f{Xl) 



^ f^F 



= -E,,5[sup \aif{Xi)\] = TZm{F, D). 
m f(zF 

Thus by the assumed lower bound on the Rademacher complexity, 

Es[sup\E[f,S]-E[f,D]\]>a/2. 

f&F 

We have left to show a lower bound with high probability. Define g{S) = supj^^ \E[f,S] — 
E[f, D] I . Any change of one element in S can cause g{S) to change by at most 1/m, Therefore, by 
McDiarmid's inequality, F[g{S) < E[g{S)] - t] < exp{-2mf). It follows that with probability at 
least 1 — 6, 

suv\E[f,S]-E[f,D]\>a/2- ■ 1^^^ 



feF 



8m 
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The next lemma provides a uniform convergence lower bound for a universal class of binary 
functions. 

Lemma 23 There exist universal constants c, C, C such that the following holds. Let H = {0, 1}'"! 
be the set of all binary functions on [n]. Let D be the uniform distribution [n]. For any n > C, with 
probability of at least ^ over i.i.d. samples of size m drawn from D, 

3h G H, \Ex^s[hiX)] - Ex^D[hiX)]\ > max{C • .1^, c}. 

V m 

Proof Denote E[h, S] = Exr^siH^)], and E[h, D] = Exr^D[h{X) = 1]. 

First, consider the case m/n < 8. For a given sample S define hs € {±1}" such that 

hs{j) = appears in S], 

and denote by N{S) the number of elements from [n] that do not appear in 5. Then 

E[hs,D] = i V /i5(i) = - V I[j appears in S] = 1 - 

n n ^ n 

j&[n] je[n] 

On the other hand, E[hs,S] = l.lt follows that 

\E[hs,S]-E[hs,D]\>N{S)/n. 
Using the fact that I — x > exp(— 2x) for x < 1/2, we get that for n > 1, 

E5[iV(5)] = F[j does not appear in 5] = n(l - i)™ > nexp(-2m/n) > nexp(-16). 

ie[n] 

It follows that Es[E[hs, S] - E[hs, D]] > exp{-16). 

To show that this difference is high with high probability over the choice of S, denote f{S) = 
E[hs, S] — E[hs, D]. Any change of one element in S can cause f{X) to change by at most 1/n, 
Therefore, by McDiarmid's inequality, P[/(5) < E[f{S)] - t] < eyip{-2nH'^ /m). It follows that 
with probability at least 1/2, 



/(S)>e.p(-16)-^J^g^>exp(-10)-^lM^, 

where the last inequality follows from the assumption that m/n < 8. It follows that there are 
constants c, C > such that n > C, with probabihty of at least 1/2, E[hs, S] — E[hs, D] > c. 

Second, consider the case m/n > 8. By Lemma [22l it suffices to provide a lower bound for 
Tlm{H, D). Fix a sample S = (xi, . . . , Xm) drawn from D. We have 

m 

mn{H,S) =E^[| sup VcTi/i(x)|]. 

h&H 



i=l 
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For a given a G {±1}™, define £ H such that h^{j) = sign(^-.^_j- cxj). Then 



ie [m] 

= E.[|^ ^ a,Kij)\] 

= E ^'^[i E 

Now, let Cj{S) = \i : Xi = j\. The expression Ecr[| J2rx^=j equal to the expected distance of 



a random walk of length Cj{S), which can be bounded from below by y^Cj{S)/2 (|Szarekl.ll976|) . 
Therefore, 



Taking expectation over samples 5 drawn from D, we get 



V2m ^, , 



(37) 



Our final step is to bound E5 [y^c~(S)] . We have Es[cj{S)] = f , and Vais[cj{S)] = f{l-^] 
Thus, by Chebyshev's inequality. 



. m , m(l — l/n) m 

[Cj{S) <--t]< J ' < -2- 



Therefore 



Ec 



> (1 ^ ) I t 



Setting t = y^2m/n, we get 



Ec 



> 



1 m / 2m 



2 V n 



n 



Now, since m/n > 8 it is easy to check that E5 [\/cj(S7] > y/m/8n. Plugging this into Eq. (|37] ). 
we get 



UiH, D) > V = i,/^. 



By Lemma|22l it follows that with probability at least 1 — 6 over samples. 
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Fixing (5 = 1/2, we get the desired lower bound. ■ 

Using the two lemmas above, we are now ready to prove our uniform convergence lower bound. 
We do this by mapping a subset of Wk to a universal class of binary functions over Q{k^) elements 
from our domain. Note that for this lower bound it suffices to consider the more restricted domain 
of binary vectors. 

Proof [of Theorem EH Let q be the largest power of 2 such that q < k.By Lemma[l9l there exists 
a set of vectors Z = {zi, . . . , 2^2} C {0, 1}'' +^ such that for every t G {±1}'' there exists a 
wt £ Wk such that for all i, t[i]{{w, Zi) - q/2) = \. Denote U = {wt \ t £ {±1}«^}. It suffices 
to prove a lower bound on the uniform convergence of U, since this implies the same lower bound 
for Wk- Define the distribution D over Z x {±1} such that for {X, Y) D, X is drawn uniformly 
from zi, . . . , Zg2 and Y = —1 with probability 1. 

Consider the set of functions H = {0, l}^ , and for h £ H define G {±1}''^ such that for all 
i G [q"^], th[i] = 2h{zi) — 1. For any i G q"^, we have 

i{zi,-l,wt,) = [r'+{w,Zi)]+ = [r'+{t[i]+k)/2]+ = [r'+{k-l)/2+h{i)]+ = r'+{k-l)/2+h{z,) 

The last equality follows since r' > . It follows that for any h £ H and any sample S drawn 
from D, 

\e{wt„S)-£{wt„D)\ = \Ex^s[h{X)]-Ex^D[h{X)]\. 
By Lemma|23l with probability of at least ^ over the sample S ~ D"^, 

3h £ H, \Ex^s[HX)] - Ex^D[h{X)]\ > Vt{^ff/m) = ^{^/WJm). 
Thus, with probability at least 1/2, 

3w£Wk, \l{wt^,S)-i{wt^,D)\>^{^/WM. 
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